Skip to content

[fix](editlog) Fix BDBEnvironment.close() ConcurrentModificationException#62061

Open
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:fix/bdbenv-close-lock
Open

[fix](editlog) Fix BDBEnvironment.close() ConcurrentModificationException#62061
dataroaring wants to merge 1 commit intoapache:masterfrom
dataroaring:fix/bdbenv-close-lock

Conversation

@dataroaring
Copy link
Copy Markdown
Contributor

@dataroaring dataroaring commented Apr 2, 2026

Summary

  • BDBEnvironment.close() iterates openedDatabases without acquiring lock.writeLock()
  • openDatabase() concurrently modifies the same ArrayList under the write lock
  • When the replayer thread catches RollbackException and calls close(), while HTTP/heartbeat/RPC handler threads call openDatabase() via getMaxJournalId(), ConcurrentModificationException is thrown
  • This kills the replayer thread, causing the follower FE to fall permanently behind the master

Fix: two-phase close pattern

  • Phase 1 (under writeLock): atomically detach resources — copy openedDatabases to local list, clear original, null out epochDB and replicatedEnvironment
  • Phase 2 (outside lock): perform actual db.close() / env.close() without holding the lock, avoiding blocking unrelated threads during potentially slow close operations

Also fixes misleading "will exit" in log messages — close() does not exit the process.

Test plan

  • Verify follower FE handles RollbackException recovery correctly
  • Verify SHOW FRONTENDS works concurrently with journal operations
  • Verify no ConcurrentModificationException in FE logs during role transitions

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 2, 2026 14:29
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a concurrency bug in BDBEnvironment.close() where iterating openedDatabases without synchronization could throw ConcurrentModificationException during concurrent openDatabase() calls, potentially killing the replayer thread.

Changes:

  • Wrap BDBEnvironment.close() with lock.writeLock() to serialize closes vs. opens/removes.
  • Ensure openedDatabases, epochDB, and replicatedEnvironment are mutated under the same write lock used elsewhere.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +484 to 515
lock.writeLock().lock();
try {
for (Database db : openedDatabases) {
try {
db.close();
} catch (DatabaseException exception) {
LOG.error("Error closing db {} will exit", db.getDatabaseName(), exception);
}
}
}
openedDatabases.clear();
openedDatabases.clear();

if (epochDB != null) {
try {
epochDB.close();
epochDB = null;
} catch (DatabaseException exception) {
LOG.error("Error closing db {} will exit", epochDB.getDatabaseName(), exception);
if (epochDB != null) {
try {
epochDB.close();
epochDB = null;
} catch (DatabaseException exception) {
LOG.error("Error closing db {} will exit", epochDB.getDatabaseName(), exception);
}
}
}

if (replicatedEnvironment != null) {
try {
// Finally, close the store and environment.
replicatedEnvironment.close();
replicatedEnvironment = null;
} catch (DatabaseException exception) {
LOG.error("Error closing replicatedEnvironment", exception);
if (replicatedEnvironment != null) {
try {
// Finally, close the store and environment.
replicatedEnvironment.close();
replicatedEnvironment = null;
} catch (DatabaseException exception) {
LOG.error("Error closing replicatedEnvironment", exception);
}
}
} finally {
lock.writeLock().unlock();
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holding the class write-lock while calling into external resource cleanup (db.close(), epochDB.close(), replicatedEnvironment.close()) can block unrelated threads for a long time and increases deadlock risk if those close paths indirectly contend on the same lock or other shared locks. A safer pattern is a two-phase close: under the write-lock, atomically “detach” resources (copy/swap openedDatabases to a local list, set fields like epochDB/replicatedEnvironment to null, optionally set a closing/closed flag to reject new opens), then release the lock and perform the actual close() calls outside the lock.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Adopted two-phase close: detach resources under writeLock, then close outside the lock. Also fixed the misleading log messages.

try {
db.close();
} catch (DatabaseException exception) {
LOG.error("Error closing db {} will exit", db.getDatabaseName(), exception);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message fragment “will exit” is misleading here because close() does not exit the process (and the code continues closing other resources). Consider updating the log text to reflect the actual behavior (e.g., “Error closing db {}” / “Error closing epoch db {}”) to avoid confusion during incident triage.

Copilot uses AI. Check for mistakes.
epochDB.close();
epochDB = null;
} catch (DatabaseException exception) {
LOG.error("Error closing db {} will exit", epochDB.getDatabaseName(), exception);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message fragment “will exit” is misleading here because close() does not exit the process (and the code continues closing other resources). Consider updating the log text to reflect the actual behavior (e.g., “Error closing db {}” / “Error closing epoch db {}”) to avoid confusion during incident triage.

Copilot uses AI. Check for mistakes.
…tion with two-phase close

BDBEnvironment.close() iterates openedDatabases without acquiring
lock.writeLock(), while openDatabase() concurrently modifies the same
ArrayList under the write lock. This causes ConcurrentModificationException
when a RollbackException triggers close() on the replayer thread while
HTTP/heartbeat/RPC handler threads call openDatabase() via getMaxJournalId().

Fix with two-phase close pattern:
- Phase 1 (under writeLock): detach all resources — copy openedDatabases
  to a local list, clear the original, and null out epochDB and
  replicatedEnvironment fields. This atomically makes them invisible to
  concurrent openDatabase() callers.
- Phase 2 (outside lock): perform the actual db.close() and env.close()
  calls without holding the lock, avoiding blocking unrelated threads
  during potentially slow close operations.

Also fix misleading "will exit" in log messages — close() does not exit.
@dataroaring dataroaring force-pushed the fix/bdbenv-close-lock branch from a9df648 to 33f8fac Compare April 3, 2026 01:29
@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/19) 🎉
Increment coverage report
Complete coverage report

@dataroaring
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 89.47% (17/19) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/19) 🎉
Increment coverage report
Complete coverage report

2 similar comments
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/19) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/19) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants